-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to derive variables and add selected derived forcings #34
base: main
Are you sure you want to change the base?
Conversation
- Update the configuration file so that we list the dependencies and the method used to calculate the derived variable instead of having a flag to say that the variables should be derived. This approach is temporary and might be revised soon. - Add a new class in mllam_data_prep/config.py for derived variables to distinguish them from non-derived variables. - Updates to mllam_data_prep/ops/loading.py to distinguish between derived and non-derived variables. - Move all functions related to forcing derivations to a new and renamed function (mllam_data_prep/ops/forcings.py).
…eck the attributes of the derived variable data-array
…derived_variables'
I think this is a good idea and have actually done that. So I have added a However, I have not used error handling a lot so I just added the Exception raises to |
mllam_data_prep/derived_variables.py
Outdated
|
||
# Get module and function names | ||
function_namespace_list = function_namespace.rsplit(".") | ||
if len(function_namespace_list) > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can actually avoid this whole if-statement if you do
module_name, _, function_name = function_namespace.rpartition(".")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didnt know of .rpartition()
on strings before, cool!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than using globals()
which would require that the module is already imported I think it might be better to use the importlib
package.
I think you would write:
module = importlib.import_module(module_name)
fn = getattr(module, function_name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if
loop is because if the module name is empty, i.e. the functions are included in the same module as the rest of the workflow for deriving variables (which they are at the moment), then the use of the importlib
package doesn't work. Also, I didn't think it made sense to import the mllam_data_prep.ops.derived_variables
since this is the module the call is being made from so that is why I added the globals()
part for these 2 cases. For all other cases (the else
statement) I am using the importlib
package approach already.
However, since we're anyway splitting up the functions from the rest (in one way or another) this approach will need to change as well and then I will just go with the importlib
package solution, as I have for the else
statement part here.
Very nice with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a few more suggestions. Hope it makes sense. Looking good! 🚀
mllam_data_prep/derived_variables.py
Outdated
try: | ||
ds = xr.open_zarr(fp) | ||
except ValueError: | ||
ds = xr.open_dataset(fp) | ||
|
||
ds_subset = xr.Dataset() | ||
ds_subset.attrs.update(ds.attrs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that would be better too. So maybe the "load" part of load_and_subset_dataset
should be split out into its own function so that we have load_input_dataset
which is called once in create_dataset
in https://github.com/ealerskans/mllam-data-prep/blob/feature/derive_forcings/mllam_data_prep/create_dataset.py#L137. And then rename the current load_and_subset_dataset
function just subset_dataset
. When calling load_input_dataset
, maybe call what is returned ds_input
and that can then be passed into both subset_dataset
and derive_variables
mllam_data_prep/derived_variables.py
Outdated
Dataset with chunking applied | ||
""" | ||
# Define the memory limit check | ||
memory_limit_check = 1 * 1024**3 # 1 GB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I would put this constant into a global variable, something like CHUNK_MAX_SIZE_WARNING
, but I would also put this "chunk size check into its own function somewhere outside of ops
, maybe mllam_data_prep.chunking
, or maybe mllam_data_prep.ops.chunking
makes sense? I think this is something we might want to reuse :)
mllam_data_prep/derived_variables.py
Outdated
|
||
# Get module and function names | ||
function_namespace_list = function_namespace.rsplit(".") | ||
if len(function_namespace_list) > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than using globals()
which would require that the module is already imported I think it might be better to use the importlib
package.
I think you would write:
module = importlib.import_module(module_name)
fn = getattr(module, function_name)
mllam_data_prep/derived_variables.py
Outdated
return day_of_year_cos, day_of_year_sin | ||
|
||
|
||
def cyclic_encoding(data, data_max): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is getting quite long. Should we maybe split the implementation of the individual functions into submodules? I also think this should maybe sit in mllam_data_prep.ops
instead of in the root of the package.
We could maybe do something like
mllam_data_prep
.ops
.derive_variable
.__init__ <-- you could import the functions that actually carry out the work from the modules mentioned below:
.dispatcher <-- this would contain the function to derive single variable
.physical_field <-- this could contain the toa field for example
.time_components <-- the time components could go here
Does this makes sense? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it is getting long and it would be nice to split it up. The main reason for not adding it to mllam_data_prep.ops
is that I don't know what "ops" stands for and what can/should be included here.
The idea and structure makes sense but the implementation is not very clear, probably because I am not very familiar with object oriented programming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it is mostly in connection with the other suggestion that we should loop over the derived variables in a call external to derive_variable
(in create_dataset()
I am assuming then, or somewhere else?) and what should be called from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have also been a bit puzzeled about what the "ops" stands for. I think I have interpreted is as "operations", but not sure if this is correct. Do we document the meaning somewhere? Otherwise, we could maybe consider to rename it to make it more self-explainable (for another PR then).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it was meant to be shorthand for "operations", sorry... regarding splitting this file it isn't so much object-oriented programming, but more about just splitting functions into separate modules that have different purposes. So that would mean making a new subdirectory (called derived_variables
) and creating a new python file (which will then each become a new submodule) and put the different functions in there.
But let's talk about it and I'll show you what I mean :)
…doesn't have all dimensions. This way we don't need to broadcast these variables explicitly to all dimensions.
dataset - Output dataset is created in 'create_dataset' instead of in the 'subset_dataset' and 'derive_variables' functions. - Rename dataset variables to make it clearer what they are and also make them more consistent between 'subset_dataset' and 'derive_variables'. - Add function for aligning the derived variables to the correct output dimensions. - Move the 'derived_variables' from their own dataset in the example config file to the 'danra_surface' dataset, as it is now possible to combine them.
…hat either 'variables' or 'derived_variables' are included and that if both are included, they don't contain the same variable names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really looking great! I will find a time in our calendar tomorrow so we can have a look together :)
@@ -301,6 +328,54 @@ class Config(dataclass_wizard.JSONWizard, dataclass_wizard.YAMLWizard): | |||
class _(JSONWizard.Meta): | |||
raise_on_unknown_json_key = True | |||
|
|||
@staticmethod | |||
def load_config(*args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great that you've put this validation code all together. Instead of creating a new method called load_config
I think it would be better to use the __post_init__()
method (https://docs.python.org/3/library/dataclasses.html#dataclasses.__post_init__) that dataclasses can implement though. I would then make your validation code here be a function that isn't a member of the Config
class but just put the global scope of this module (you could maybe call i validate_config
?). Using __post_init__
will ensure this method is automatically called whatever the dataclass instance is initialised from (be it a yaml file, yaml formatted string, json or directly as a Config
object)
@@ -103,7 +103,7 @@ The package can also be used as a python module to create datasets directly, for | |||
import mllam_data_prep as mdp | |||
|
|||
config_path = "example.danra.yaml" | |||
config = mdp.Config.from_yaml_file(config_path) | |||
config = mdp.Config.load_config(config_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you make the change I suggested to use __post_init__
on the Config
dataclass then you can still use from_yaml_file
here which makes it clear we are just using that method as implemented by dataclasses-wizard
:)
@@ -175,6 +175,18 @@ inputs: | |||
variables: | |||
# use surface incoming shortwave radiation as forcing | |||
- swavr0m | |||
derived_variables: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beautiful!
@@ -120,23 +125,50 @@ def create_dataset(config: Config): | |||
|
|||
output_config = config.output | |||
output_coord_ranges = output_config.coord_ranges | |||
chunking_config = config.output.chunking | |||
|
|||
dataarrays_by_target = defaultdict(list) | |||
|
|||
for dataset_name, input_config in config.inputs.items(): | |||
path = input_config.path | |||
variables = input_config.variables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should rename this selected_variables
? What do you think? It is a bit confusing to have both variables
and derived_variables
, maybe calling it selected_variables
will make it clear that this part of the code deals with the ones that are "selected" from the input dataset
@@ -255,7 +286,7 @@ def create_dataset_zarr(fp_config, fp_zarr: str = None): | |||
The path to the zarr file to write the dataset to. If not provided, the zarr file will be written | |||
to the same directory as the config file with the extension changed to '.zarr'. | |||
""" | |||
config = Config.from_yaml_file(file=fp_config) | |||
config = Config.load_config(file=fp_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here for the .from_yaml_file
call :)
# Where TOA radiation is negative, set to 0 | ||
toa_radiation = xr.where(solar_constant * cos_sza < 0, 0, solar_constant * cos_sza) | ||
|
||
if isinstance(toa_radiation, xr.DataArray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perfect!
|
||
Returns | ||
------- | ||
hour_of_day_cos: Union[xr.DataArray, float] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be simpler (and more explicit which is often good) to change the config to explicitly call these features hour_of_day_cos
and hour_of_day_sin
, and that way each function only every returns one derived variable (which makes the rest of the code simpler I think)
Sine part of cyclically encoded input data | ||
""" | ||
|
||
data_sin = np.sin((data / data_max) * 2 * np.pi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could split this into two functions or make cos
or sin
(as strings) an argument to this function say cyclically_encode_values(values, max_value, component='cos')
. But inline might be better to make this more explicit (since it is only one line)
chunks = { | ||
dim: chunking.get(dim, int(ds_subset[dim].count())) for dim in ds_subset.dims | ||
} | ||
ds_subset = ds_subset.chunk(chunks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the chunking is done here already (during load), why do we need to do it again later when deriving variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand the mocking, patching and tests in here :) but let's talk about it so I can learn!
Implements the ability to derive fields from the input datasets, as discussed in Deriving forcings #29.
At the moment, I have only added the possibility to derive the following forcings:
time of year (cyclically encoded)But additional variables, such as boundary and land-sea masks, should be added. But I think that is for another PR.